With a focus on high-throughput sequencing data
CFAES Bioinformatics Core, OSU
2025-08-26
Both genomics and transcriptomics data is produced by high-throughput sequencing technologies.
That will be the focus of this lecture and will be used in examples throughout the course.
The shorthand sequencing, like in “high-throughput sequencing”, generally refers to determining the nucleotide sequence of fragments of DNA.
What about RNA or proteins?
RNA is usually reverse transcribed to DNA (cDNA) prior to sequencing, as in nearly all “RNA-seq”.
Direct RNA sequencing is possible with one of the sequencing technologies we’ll discuss, but this is under development and not yet widely used.
Sanger sequencing (since 1977)
Sequences a single, typically PCR-amplified, short-ish (≤900 bp) DNA fragment at a time
High-throughput sequencing (HTS, since 2005)
Sequences 105-109s, usually randomly selected, DNA fragments (“reads”) at a time
Modified after Pereira et al. 2020
Modified after Pereira et al. 2020
https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost
Variant analysis (for population genetics/genomics, molecular evolution, GWAS, etc.):
Whole-genome “resequencing”
Reduced-representation libraries (e.g. RADseq, GBS)
Microbial community characterization
Metabarcoding
Shotgun metagenomics
Short-read HTS
Long-read HTS
Short videos explaining the technology (90 s - 5 m each)
Short-read (Illumina) HTS: 50-300 bp reads
Long-read HTS: longer & more variable read lengths (PacBio: 10-50 kbp, ONT: 10-100+ kbp)
Genome assembly
Haplotype and large structural variant calling
Transcript isoform identification
Taxonomic identification of single reads (microbial metabarcoding)
SNP variant analysis
Read-as-a-tag: the goal is just to know a read’s origin in a reference genome, like in counting applications such as RNA-seq
Currently, no sequencing technology is error-free.
Error rates are changing
Error rates in one recent type of PacBio sequencing where individual fragments are sequenced multiple times (“HiFi”) are now lower than in Illumina.
Error rates of ONT sequencing are also continuously decreasing.
Quality scores in sequence data
When you get sequences from a high-throughput sequencer, base calls have typically already been made. Every base is also accompanied by a quality score (inversely related to the estimated error probability). We’ll talk about those in some more detail in a bit.
Sequencing every bases multiple times, i.e. having a >1x so-called “depth of coverage” allows to infer the correct sequence:
Overcoming sequencing errors is made more challenging by natural genetic variation among and within individuals.
Typical depths of coverage: at least 50-100x for genome assembly; 10-30x for resequencing.
We will talk a but about Illumina library prep because this is the most common type of sequencing, and because throughout the course, we will use Illumina read files as examples.
In a sequencing context, a “library” is a collection of nucleic acid fragments ready for sequencing.
In Illumina and other HTS libraries, these fragments number in the millions or billions and are often randomly generated from input such as genomic DNA:
An overview of the library prep procedure. This is typically done for you by a sequencing facility or company.
As shown in the previous slide, after library prep, each DNA fragment is flanked by several types of short sequences that together make up the “adapters”:
DNA fragments can be sequenced from both ends as shown below —
this is called “paired-end” (PE) sequencing:
When sequencing is instead single-end (SE), no reverse read is produced:
The size of the DNA fragment can vary – both by design and because of limited precision in size selection. In some cases, it is:
Multiplexing!
Using the indices/barcodes in adapters, up to 96 samples can be multiplexed into a single library.
Most HTS applications either require a “reference genome” or involve its production.
What exactly does “reference genome” refer to? We’ll discuss three components to this phrase:
Taxonomic identity
Typically considered at the species level, so then it should involve the focal species. But:
If needed, it is often possible to work with reference genomes of closely related species
Conversely, multiple reference genomes may exist, e.g. for different subspecies
With increasing usage & quality of long-read HTS, we are generating better assemblies
For chromosome-level assemblies, i.e. with one contiguous sequence for each chromosome, additional technologies than sequencing are often needed (e.g. Hi-C, optical mapping)
Many assemblies are not “chromosome-level”, but consist of –often 1000s of– contigs and scaffolds.
Even chromosome-level assemblies are not 100% complete
Contigs are contiguous, known stretches of DNA created by the assembly process, basically by overlapping reads.
Often, the order and orientation of two or more contigs is known, but there is a gap of unknown size between them. Such contigs are connected into scaffolds with a stretch of Ns in between.
All common genetic/genomic data files are plain-text, meaning that they can be opened by any text editor. However, they are often compressed to save space. The main types are:
FASTQ
The standard format for HTS reads — contains a quality score for each nucleotide.
SAM/BAM
An alignment format for HTS reads.
FASTA files contain one or more (sometimes called multi-FASTA) DNA or amino acid sequences, with no limits on the number of sequences or the sequence lengths.
As mentioned, they are versatile, and are the standard format for:
Genome assembly sequences
Transcriptomes and proteomes
Sequence downloads from NCBI such as a single gene/protein or other GenBank entry
Sequence alignments (but not from HTS reads)
The following example FASTA file contains two entries:
>unique_sequence_ID Optional description
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA
>unique_sequence_ID2
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAATGEach entry contains a header and the sequence itself, and:
> and are otherwise “free form” but usually provide an identifier (and sometimes metadata) for the sequenceFASTA file name extensions are variable:
Generic extensions are .fasta and .fa
Other extensions explicitly indicate whether sequences are nucleotide (.fna) or amino acids (.faa)
FASTQ is the standard format for HTS reads.
Each read forms one FASTQ entry and is represented by four lines, which contain, respectively:
@ and e.g. uniquely identifies the read+ (plus sign)The quality scores we saw in the read on the previous slide represent an estimate of the error probability of the base call.
Specifically, they correspond to a numeric “Phred” quality score (Q), which is a function of the estimated probability that a base call is erroneous (P):
Q = -10 * log10(P)
For some specific probabilities and their rough qualitative interpretation for Illumina data:
| Phred quality score | Error probability | Rough interpretation |
|---|---|---|
| 10 | 1 in 10 | terrible |
| 20 | 1 in 100 | bad |
| 30 | 1 in 1,000 | good |
| 40 | 1 in 10,000 | excellent |
This numeric quality score is represented in FASTQ files not by the number itself, but by a corresponding “ASCII character”.
This allows for a single-character representation of each possible score — as a consequence, each quality score character can conveniently correspond to (& line up with) a base character in the read.
| Phred quality score | Error probability | ASCII character |
|---|---|---|
| 10 | 1 in 10 | + |
| 20 | 1 in 100 | 5 |
| 30 | 1 in 1,000 | ? |
| 40 | 1 in 10,000 | I |
A rule of thumb
In practice, you almost never have to manually check the quality scores of bases in FASTQ files, but if you do, a rule of thumb is that letter characters are good (Phred of 32 and up).
FASTQ files have no size limit, so you may receive a single file per sample, although:
With paired-end (PE) sequencing, forward and reverse reads are split into two files:
forward reads contain R1 and reverse reads contain R2 in the file name.
If sequencing was done on multiple “lanes”, you get separate files for each lane.
FASTQ files have the extension .fastq or .fq (but are commonly compressed, leading to fastq.gz etc.). All in all, having paired-end FASTQ files for 2 samples could look like this: